Term Weighting in Short Documents for Document Categorization, Keyword Extraction and Query Expansion

نویسنده

  • Mika Timonen
چکیده

This thesis focuses on term weighting in short documents. I propose weighting approaches for assessing the importance of terms for three tasks: (1) document categorization, which aims to classify documents such as tweets into categories, (2) keyword extraction, which aims to identify and extract the most important words of a document, and (3) keyword association modeling, which aims to identify links between keywords and use them for query expansion. As the focus of text mining is shifting toward datasets that hold usergenerated content, for example, social media, the type of data used in the text mining research is changing. The main characteristic of this data is its shortness. For example, a user status update usually contains less than 20 words. When using short documents, the biggest challenge in term weighting comes from the fact that most words of a document occur only once within the document. This is called hapax legomena and we call it Term Frequency = 1, or TF=1 challenge. As many traditional feature weighting approaches, such as Term Frequency Inverse Document Frequency, are based on the occurrence frequency of each word within a document, these approaches do not perform well with short documents. The first contribution of this thesis is a term weighting approach for doc-

منابع مشابه

Domain Keyword Extraction Technique: a New Weighting Method Based on Frequency Analysis

On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those applications such as search engine, text categorization, summarization, and topic detection are based on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So a...

متن کامل

Query Architecture Expansion in Web Using Fuzzy Multi Domain Ontology

Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...

متن کامل

Interactive Query Expansion with Automatically Generated Category-Specific Thesauri∗

The categorization of documents into subjectspecific categories is a useful enhancement for large document collections addressed by information retrieval systems, as a user can first browse a category tree in search of the category that best matches her interests, and then issue a query for more specific documents “from within the category”. This approach combines two modalities in information ...

متن کامل

Combining Qualitative and Quantitative Keyword Extraction Methods with Document Layout Analysis

The large availability of documents in digital format posed the problem of efficient and effective retrieval mechanisms. This involves the ability to process natural language, which is a significantly complex task. Traditional algorithms based on term matching between the document and the query, although efficient, are not able to catch the intended meaning of both, and hence cannot ensure effe...

متن کامل

Web Information Retrieval using WordNet

Information retrieval (IR) is the area of study concerned with searching documents or information within documents. The user describes information needs with a query which consists of a number of words. Finding weight of a query term is useful to determine the importance of a query. Calculating term importance is fundamental aspect of most information retrieval approaches and it is traditionall...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012